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Figure 1 | A whole new world: Genie is capable of converting a variety of different prompts into 
interactive, playable environments that can be easily created, stepped into, and explored. This is 
made possible via a latent action interface, learned fully unsupervised from Internet videos. On the 
right we see a few generated steps for taking two latent actions. See more examples on our website. 


We introduce Genie, the first generative interactive environment trained in an unsupervised manner 
from unlabelled Internet videos. The model can be prompted to generate an endless variety of action- 
controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 
11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal 
video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. 
Genie enables users to act in the generated environments on a frame-by-frame basis despite training 
without any ground-truth action labels or other domain-specific requirements typically found in the 
world model literature. Further the resulting learned latent action space facilitates training agents to 
imitate behaviors from unseen videos, opening the path for training generalist agents of the future. 
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1. Introduction 


The last few years have seen an emergence of 
generative AI, with models capable of generat- 
ing novel and creative content. Driven by break- 
throughs in architectures such as transformers 
(Vaswani et al., 2017), advances in hardware, and 
a recent focus on scaling models and datasets, we 
can now generate coherent, conversational lan- 
guage (Brown et al., 2020; Radford et al., 2018, 
2019), as well as crisp and aesthetically pleas- 
ing images from a text prompt (Ramesh et al., 
2021, 2022; Rombach et al., 2022; Saharia et al., 
2022). Early signs indicate video generation will 
be yet another frontier, with recent results sug- 
gesting that such models may also benefit from 
scale (Blattmann et al., 2023a; Esser et al., 2023; 
Ho et al., 2022a; Hong et al., 2023). Still, there 
remains a gulf between the level of interactions 
and engagement of video generative models and 
language tools such as ChatGPT, let alone more 
immersive experiences. 


What if, given a large corpus of videos from 
the Internet, we could not only train models ca- 
pable of generating novel images or videos, but 
entire interactive experiences? We propose gen- 
erative interactive environments, a new paradigm 
for generative AI whereby interactive environ- 
ments can be generated from a single text or 
image prompt. Our approach, Genie, is trained 
from a large dataset of over 200,000 hours of 
publicly available Internet gaming videos and, de- 
spite training without action or text annotations, 
is controllable on a frame-by-frame basis via a 
learned latent action space (see Table 1 for a com- 
parison to other approaches). At 11B parameters, 
Genie exhibits properties typically seen in foun- 
dation models—it can take an unseen image as 
a prompt making it possible to create and play 
entirely imagined virtual worlds (e.g Figure 2). 


Genie builds on ideas from state-of-the-art 
video generation models (Gupta et al., 2023; Vil- 
legas et al., 2023), with a core design choice be- 
ing spatiotemporal (ST) transformers (Xu et al., 
2020) which are used in all of our model com- 
ponents. Genie utilizes a novel video tokenizer, 
and extracts latent actions via a causal action 
model. Both the video tokens and latent actions 
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Figure 2 | Diverse trajectories: Genie is a gen- 
erative model that can be used as an interactive 
environment. The model can be prompted in var- 
ious ways, either with a generated image (top) or 
a hand-drawn sketch (bottom). At each time step, 
the model takes a user-provided latent action to 
generate the next frame, producing trajectories 
with interesting and diverse character actions. 


are passed to a dynamics model, which autore- 
gressively predicts the next frame using MaskGIT 
(Chang et al., 2022). We provide a rigorous scal- 
ing analysis of our architecture with respect to 
both batch and model size, which we vary from 
40M to 2.7B parameters. The results show that 
our architecture scales gracefully with additional 
computational resources, leading to a final 11B 
parameter model. We train Genie on a filtered set 
of 30,000 hours of Internet gameplay videos from 
hundreds of 2D platformer games, producing a 
foundation world model for this setting. 


To demonstrate the generality of our approach, 
we also train a separate model on action-free 
robot videos from the RT1 dataset (Brohan et al., 
2023), learning a generative environment with 
consistent latent actions. Finally, we show that 
latent actions learned from Internet videos can be 
used for inferring policies from unseen action-free 
videos of simulated reinforcement learning (RL) 
environments, indicating that Genie may hold the 
key to unlocking unlimited data for training the 
next generation of generalist agents (Bauer et al., 
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2023; Clune, 2019; Open Ended Learning Team 
et al., 2021; Reed et al., 2022). 


Table 1 | A new class of generative model: Genie 
is a novel video and world model that is control- 
lable on a frame-by-frame basis, which requires 
only video data at train time. 

Model Class 


Training Data Controllability 


World Models Video + Actions Frame-level 
Video Models Video + Text Video-level 
Genie Video Frame-level 


2. Methodology 


Genie is a generative interactive environment 
trained from video-only data. In this section we 
begin with preliminaries before explaining the 
main components of our model. 


Several components in the Genie architecture 
are based on the Vision Transformer (ViT) (Doso- 
vitskiy et al., 2021; Vaswani et al., 2017). No- 
tably, the quadratic memory cost of transformers 
poses challenges for videos, which can contain 
up to O(10^) tokens. We thus adopt a memory 
efficient ST-transformer architecture (inspired by 
Xu et al. (2020), see Figure 4) across all model 
components, balancing model capacity with com- 
putational constraints. 


Unlike a traditional transformer where every 
token attends to all others, an ST-transformer 
contains L spatiotemporal blocks with interleaved 
spatial and temporal attention layers, followed 
by a feed-forward layer (FFW) as standard at- 
tention blocks. The self-attention in the spatial 
layer attends over the 1 x H x W tokens within 
each time step, and in the temporal layer attends 
over Tx 1 x 1 tokens across the T time steps. 
Similar to sequence transformers, the temporal 
layer assumes a causal structure with a causal 
mask. Crucially, the dominating factor of compu- 
tation complexity (i.e. the spatial attention layer) 
in our architecture scales linearly with the num- 
ber of frames rather than quadratically, making 
it much more efficient for video generation with 
consistent dynamics over extended interactions. 
Further, note that in the ST block, we include only 


one FFW after both spatial and temporal compo- 
nents, omitting the post-spatial FFW to allow for 
scaling up other components of the model, which 
we observe to improve results significantly. 


2.1. Model Components 


As shown in Figure 3, our model contains three 
key components: 1) a latent action model that 
infers the latent action a between each pair of 
frames and 2) a video tokenizer that converts 
raw video frames into discrete tokens z and 3) a 
dynamics model that, given a latent action and 
past frame tokens, predicts the next frame of the 
video. The model is trained in two phases follow- 
ing a standard autoregressive video generation 
pipeline: we train the video tokenizer first, which 
is used for the dynamics model. We then co-train 
the latent action model (directly from pixels) and 
the dynamics model (on video tokens). 


Latent Action Model (LAM) To achieve con- 
trollable video generation, we condition each fu- 
ture frame prediction on the action taken at the 
previous frame. However, such action labels are 
rarely available in videos from the Internet and ac- 
tion annotation can be costly to obtain. Instead, 
we learn latent actions in a fully unsupervised 
manner (see Figure 5). 


First, an encoder takes as inputs all previous 
frames x1: = (x1, ++- xi) as well as the next frame 
X41, and outputs a corresponding set of contin- 
uous latent actions d4.; = (d1,:--dà;). A decoder 
then takes all previous frames and latent actions 
as input and predicts the next frame x;44. 


To train the model, we leverage a VQ-VAE- 
based objective (van den Oord et al., 2017), 
which enables us to limit the number of predicted 
actions to a small discrete set of codes. We limit 
the vocabulary size |A| of the VQ codebook, i.e. 
the maximum number of possible latent actions, 
to a small value to permit human playability and 
further enforce controllability (we use |A| = 8 in 
our experiments). As the decoder only has access 
to the history and latent action, a, should encode 
the most meaningful changes between the past 
and the future for the decoder to successfully 
reconstruct the future frame. Note that this de- 
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Figure 3 | Genie model training: Genie takes in T frames of video as input, tokenizes them into 
discrete tokens z via the video tokenizer, and infers the latent actions à between each frame with the 
latent action model. Both are then passed to the dynamics model to generate predictions for the next 


frames in an iterative manner. 
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Figure 4 | ST-transformer architecture. The ar- 
chitecture is composed of L spatiotemporal blocks, 
each containing a spatial layer, temporal layer 
and feed-forward layer. Each color represents a 
single self-attention map, with the spatial layer 
attending over the H x W tokens from within a 
single time step, and temporal the same token 
from across the T time steps. 
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Figure 5 | Latent action model: learns actions a; 
unsupervised from unlabelled video frames. 


coder exists only to give the LAM training signal. 
In fact, apart from the VQ codebook, the entire 
LAM is discarded at inference time and replaced 
with actions from the user. 


We utilize our ST-transformer architecture for 
the latent action model. The causal mask in the 
temporal layer allows us to take the entire video 
xi. as input and generate all latent actions be- 
tween each frame d1.7.1. 


Video Tokenizer Following prior work (Gupta 
et al., 2023; Villegas et al., 2023; Yan et al., 2023), 
we compress videos into discrete tokens to re- 
duce dimensionality and enable higher quality 
video generation (see Figure 6). We again make 
use of VQ-VAE, which takes in T frames of video 


xr = (x1, X2,°°+ , xr) € RIXEXVY*€ as input, gen- 
erating discrete representations for each frame 
Zip = (21,22,7:: , Zr) € IP, where D is the size 


of the discrete latent space. The tokenizer is 
trained using a standard VQ-VQAE objective over 
the entire video sequence. 
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Figure 6 | Video tokenizer: a VQ-VAE with ST- 
transformer. 


Unlike prior works that focus on spatial-only 
compression in the tokenization phase (Gupta 
et al., 2023; Hong et al., 2022; Wu et al., 2022), 
we utilize the ST-transformer in both the encoder 
and decoder to incorporate temporal dynamics 
in the encodings, which improves the video gen- 
eration quality. By the causal nature of the ST- 
transformer, each discrete encoding z, contains 
information from all previously seen frames of the 
video x}... Phenaki (Villegas et al., 2023) also uses 
a temporal-aware tokenizer, C-ViViT, but this ar- 
chitecture is compute intensive, as the cost grows 
quadratically with the number of frames—in com- 
parison, our ST-transformer based tokenizer (ST- 
ViViT) is much more compute efficient with the 
dominating factor in its cost increasing linearly 
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with the number of frames. 
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Figure 7 | Dynamics model: takes in video to- 
kens and action embeddings, and predicts future 
masked video tokens. 


Dynamics Model The dynamics model is a 
decoder-only MaskGIT (Chang et al., 2022) trans- 
former (Figure 7). At each time step t € [1,T], it 
takes in the tokenized video z).,_1 and stopgrad 
latent actions ã1:+-1 and predicts the next frame 
tokens 2,. We again utilize an ST-transformer, 
whose causal structure enables us to use tokens 
from all (T — 1) frames z4.7..; and latent actions 
ãı:r-1 as input, and generate predictions for all 
next frames 25.5. The model is trained with a 
cross-entropy loss between the predicted tokens 
Z5.r and ground-truth tokens z2.;. At train time 
we randomly mask the input tokens z5.7.; ac- 
cording to a Bernoulli distribution masking rate 
sampled uniformly between 0.5 and 1. Note that 
a common practice for training world-models, in- 
cluding transformer-based models, is to concate- 
nate the action at time t to the corresponding 
frame (Micheli et al., 2023; Robine et al., 2023). 
However, we found that treating the latent ac- 
tions as additive embeddings for both the latent 
action and dynamics models helped to improve 
the controllability of the generations. 


2.2. Inference: Action-Controllable Video Gen- 
eration 


We now describe how to use Genie for action- 
controllable video generation at inference time 
(see Figure 8). A player first prompts the model 
with an image x that serves as the initial frame!. 
The image is tokenized using the video encoder, 
yielding zı. The player then specifies a discrete 
latent action a; to take by choosing any integer 


lThe model can be conditioned on a varying number 
of prompt frames. Here we start from one image as an 
example. 
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Figure 8 | Genie Inference: the prompt frame is 
tokenized, combined with the latent action taken 
by the user, and passed to the dynamics model for 
iterative generation. The predicted frame tokens 
are then decoded back to image space via the 
tokenizer's decoder. 


Iterative generation 


value within [0, |A|). The dynamics model takes 
the frame tokens zı and corresponding latent ac- 
tion d1, which is obtained by indexing into the VQ 
codebook with the discrete input a1, to predict 
the next frame tokens z2. This process is repeated 
to generate the rest of the sequence 25.7 in an au- 
toregressive manner as actions continue to be 
passed to the model, while tokens are decoded 
into video frames X5.; with the tokenizer's de- 
coder. Note that we can regenerate ground truth 
videos from the dataset by passing the model 
the starting frame and inferred actions from the 
video, or generate completely new videos (or tra- 
jectories) by changing the actions. 


3. Experimental Results 


Datasets We train Genie on a large-scale dataset 
collected from publicly available Internet videos 
of 2D Platformer games (referred to from here on 
as ^Platformers"). We construct the Platformers 
dataset by filtering publicly available videos for 
keywords relating to platformers, yielding 55M 
16s video clips at 10FPS, with 160x90 resolution. 
The final dataset contains 6.8M 16s video clips 
(30k hours), within an order of magnitude of 
other popular Internet video datasets (Bain et al., 
2021; Wang et al., 2023). More details can be 
found in Appendix B.1. Unless otherwise spec- 
ified, results are with a 11B-parameter model 


2when first interacting with the model, it is unclear how 
each latent action will impact the next frame generation. 
However, we found that the meaning of each action remained 
consistent across different inputs. Hence, interpreting the 
mapping of latent actions is akin to learning the buttons on 
a new controller. 
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Figure 9 | Scaling results. Left: Training curves for different model sizes, Middle: Final training loss 
for each model size, averaged over the last 300 updates, Right: Final training loss for a 2.3B model 


with different batch sizes. 


trained on this dataset. 


To verify the generality of our method, we also 
consider the robotics datasets used to train RT1 
Brohan et al. (2023), combining their dataset 
of ~130k robot demonstrations with a separate 
dataset of simulation data and the 209k episodes 
of real robot data from prior work (Kalashnikov 
et al., 2018). Note that we do not use actions 
from any of these datasets, and simply treat them 
as videos. For simplicity, from here on we refer 
to this dataset as “Robotics”. 


Metrics We examine the video generation per- 
formance of Genie via two factors, namely video 
fidelity, i.e. the quality of video generation, and 
controllability, i.e. how much impact the latent 
actions have in video generation. For video fi- 
delity we use the Frechet Video Distance (FVD), 
a video-level metric, which has been shown to 
have a high level of alignment to human evalua- 
tion on video quality (Unterthiner et al., 2019). 
For controllability, we devise a metric based on 
peak signal-to-noise ratio (PSNR) which we call 
A,PSNR, that measures how much the video gen- 
erations differ when conditioned on latent actions 
inferred from ground-truth (£) vs. sampled from 
a random distribution (37): 


A,PSNR = PSNR(x,, £j) - PSNR(x;, 22), 


where x, denotes the ground-truth frame at time 
t, X, denotes the frame from latent actions d.; 
inferred from ground-truth frames, and x/ the 
same frame generated from a sequence of latent 
actions randomly sampled from a categorical dis- 
tribution. As such, the greater A,PSNR is, the 
more the video generated from random latent ac- 


tions differs from ground-truth, which indicates 
a higher level of controllability from the latent 
actions. For all experiments we report A,PSNR 
with t = 4. 


Training Details Our video tokenizer uses 
200M parameters, a patch size of 4 and a code- 
book with embedding size 32 and 1024 unique 
codes, which we found to be the most effective 
given the trade-off between reconstruction qual- 
ity of the tokenizer and downstream performance 
of video prediction. The latent action model has 
300M parameters, a patch size of 16, and a code- 
book with embedding size 32 and 8 unique codes 
(latent actions). For all modelling components 
we use a sequence length of 16 frames with an 
FPS of 10. Further, we employ bfloat16 and QK 
norm for training our dynamics model, which has 
been shown to stabilize training at large scale 
(Dehghani et al., 2023; Henry et al., 2020). At 
inference time, we perform 25 MaskGIT steps for 
the sampling of each frame with a temperature 
of 2 using random sampling. See Appendix C for 
more details. 


3.1. Scaling Results 


In this section, we investigate the scaling behavior 
of our model. To this end, we conduct studies 
that explore the impact of both model size and 
batch size. See Appendix D for more details on 
architecture and compute usage. 


Scaling Model Size Given a fixed video tok- 
enizer and action model architecture, we train a 
series of dynamics models ranging from 40M to 
2.7B parameters. Figure 9 shows our architecture 
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Figure 10 | Playing from Image Prompts: We can prompt Genie with images generated by text-to- 
image models, hand-drawn sketches or real-world photos. In each case we show the prompt frame 
and a second frame after taking one of the latent actions four consecutive times. In each case we see 
clear character movement, despite some of the images being visually distinct from the dataset. 


scales gracefully with model parameters, with 
each increase in size corresponding to a consis- 
tent decrease in the final training loss. This is a 
strong indication that our approach benefits from 
scaling, which we exploit with our main Genie 
model. 


Scaling Batch Size We also investigate the ef- 
fect of scaling the batch size, considering a 2.3B 
model with batch sizes of 128, 256, and 448, 
equating to 1.9M, 3.8M and 6.6M tokens. As 
shown in Figure 9, increasing the batch size leads 
to a similarly favorable gain in terms of model 
performance. 


Genie Model It is clear that increasing both 
model size and batch size helps improve model 
performance. As a result, for our final model, we 
train a 10.1B dynamics model with a batch size of 
512, for a total of 125k steps, using 256 TPUv5p. 
When combined with the tokenizer and action 
model this brings the total to 10.7B parameters, 
trained on 942B tokens, which we refer to as the 
Genie model. For our website, we train a larger 
decoder mapping tokens to 360p videos, adding 
additional parameters. 


3.2. Qualitative Results 


We now present qualitative results from the Ge- 
nie model. We showcase a 11B parameter model 
trained on the Platformers dataset and a smaller 
model trained on the Robotics dataset. Our model 
generates high-quality, controllable videos across 
diverse domains. Notably, we qualitatively eval- 
uate our Platformers-trained model using only 
out-of-distribution (OOD) image prompts, includ- 
ing those generated from text-to-image models, 
hand-drawn sketches, and even realistic photos. 
The ability to generalize to such significantly 
OOD inputs underscores the robustness of our 
approach and the value of training on large-scale 
data, which would not have been feasible with 
real actions as input. 


Platformers-trained model Figure 10 show- 
cases examples of our model's generations 
prompted from OOD images, including (top 
row) images generated from Imagen2 (Ho et al., 
2022a; van den Oord et al.), (second row) hand- 
drawn sketches and (bottom row) real-world pho- 
tos. Genie is able to bring these imagined worlds 
to life, as we see game-like behaviour when in- 
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Figure 11 | Learning to simulate deformable objects: we show frames from a ten step trajectory in 
the model, taking the same action. Genie is capable of learning the physical properties of objects such 


as bags of chips. 


teracting with each example. We showcase more 
generations by our model in Appendix A, addi- 
tionally highlighting the consistency of the latent 
actions. 


Figure 12 | Emulating parallax, a common fea- 
ture in platformer games. From this initial text- 
generated image, the foreground moves more 
than the near and far middle ground, while the 
background moves only slightly. 


Another emergent capability of our model is 
its ability to understand 3D scenes and emulate 
parallax, which is commonly seen in platformer 
games. In Figure 12 we show an image generated 
by Imagen2, where taking a latent action moves 
the foreground at a different rate to the back- 
ground (as indicated by the length of different 
colored arrows). 


Robotics-trained model We trained a 2.5B- 
parameter model on the Robotics dataset using 
the same hyperparameters found to be best on 
Platformers, achieving an FVD of 82.7 on the test 
split. As shown in Figure 13, this model success- 
fully learns distinct and consistent actions from 
video data, requiring neither text nor action labels 
(as in e.g. Yang et al. (2023)). Notably, our model 
learns not only the controls of the robotic arm but 
also the interactions and deformations of various 
objects (Figure 11). We believe this shows our 
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Figure 13 | Controllable, consistent latent ac- 
tions in Robotics: trajectories beginning from 
three different starting frames from our Robotics 
dataset. Fach column shows the resulting frame 
from taking the same latent action five times. De- 
spite training without action labels, the same ac- 
tions are consistent across varied prompt frames 
and have semantic meaning: down, up and left. 
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approach presents a path to using larger video 
datasets from the Internet to create a founda- 
tional world model for robotics, with low-level 
controllable simulation that could be used for a 
variety of applications. 


3.3. Training Agents 


We believe Genie could one day be used as a foun- 
dation world model for training generalist agents. 
In Figure 14 we show that the model can already 
be used for generating diverse trajectories in un- 
seen RL environments given starting frames. We 
further investigate if latent actions learnt from 
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Internet videos can be used for imitating behav- 
iors from unseen videos. We use a frozen LAM 
to label a sequence of expert videos from a tar- 
get environment with discrete latent actions and 
then train a policy that predicts the likelihood of 
the expert taking a latent action given an obser- 
vation. We then use a small dataset with expert 
ground-truth actions for mapping latent to real 
actions (see Appendix E for more details). 


Prompt Play! 


Figure 14 | Playing from RL environments: Ge- 
nie can generate diverse trajectories given an im- 
age of an unseen RL environment. 


We evaluate in both hard and easy settings of 
a procedurally generated 2D-platformer environ- 
ment, CoinRun (Cobbe et al., 2020), and com- 
pare against an oracle behavioral cloning (BC) 
model that has access to expert actions as an up- 
per bound, and a random agent as a lower bound 
(Figure 15). The LAM-based policy achieves the 
same score as the oracle given as few as 200 ex- 
pert samples to adapt, despite almost certainly 
never seeing CoinRun before. This provides evi- 
dence that the learnt latent actions are consistent 
and meaningful for transfer, as the mapping from 
latent to real contains no information about the 
current observation. 


3.4. Ablation Studies 


Design choices for latent action model In de- 
signing our latent action model, we carefully con- 
sidered the type of input to use. While we ulti- 
mately chose to use the original images (pixels), 
we evaluated this choice against the alternative 
of using tokenized images (replacing x with z in 
Figure 5). We refer to this alternative approach 
as the “token-input" model (see Table 2). 


While this model achieved a slightly lower FVD 
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Figure 15 | BC results. Mean percentage of levels 
solved out of 100 samples, averaged over 5 seeds 
with 9596 confidence intervals. 


score on the Platformers dataset, it did not main- 
tain this advantage on the Robotics dataset. More 
importantly in both environments, the token- 
input model exhibited worse controllability (as 
measured by A;PSNR). This suggests that some 
information about video dynamics and movement 
might have been lost during tokenization, and as 
a result it is beneficial for the latent action model 
to take in raw videos as input. 


Table 2 | Latent action model input ablation. 
We see that Genie achieves higher controllability. 


Dataset #Params FVD (|) A,PSNR(T) 
Token-input Platformers 2.3B 38.8 1.33 
Pixel-input (Genie) Platformers 2.5B 40.1 1.91 
Token-input Robotics 1B 257.8 1.65 
Pixel-input (Genie) Robotics 1B 136.4 2.07 


Tokenizer architecture ablations We com- 
pare the performance of three choices of tokeniz- 
ers, including 1) (spatial-only) ViT, 2) (spatial- 
temporal) ST-ViViT and 3) (spatial-temporal) C- 
ViViT (Table 3). For comparison we use similar 
number of parameters for all tokenizers, with 
patch size 10, batch size 128 and sequence length 
16. We then train the same dynamics and latent 
action model on these three different tokenizers, 
and report their FVD as well as A,PSNR. 


Table 3 | Tokenizer architecture ablation: Our 
ST-ViViT architecture results in the best perform- 
ing tokenizer. 


#Params Memory FVD (|) A,PSNR(T) 
ViT 230M 0.3GB 114.5 1.39 
C-ViViT (Villegas et al., 2023) 225M 1.6GB 272.7 1.37 
ST-ViViT (ours) 205M 0.9GB 81.4 1.66 
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Our proposed ST-ViViT architecture provides 
both improved video generation (FVD) and 
A,PSNR, for a reasonable trade-off in memory, as 
compared to to C-ViViT and the spatial-only ViT. 
This demonstrates its ability to generate videos 
of high fidelity and controllability, respectively. 
While C-ViViT employs a full space-time atten- 
tion mechanism, resulting in significantly higher 
memory consumption compared to the other two 
architectures at the same parameter count, this 
does not translate to improved performance. In 
fact, C-ViViT exhibits a tendency towards overfit- 
ting, necessitating strong regularization during 
training, which might explain its considerably 
lower performance. 


4. Related Work 


World models Generative interactive environ- 
ments can be considered a class of World Models 
(Ha and Schmidhuber, 2018; Oh et al., 2015), 
which enable next-frame prediction that is con- 
ditioned on action inputs (Bamford and Lucas, 
2020; Chiappa et al., 2017; Eslami et al., 2018; 
Hafner et al., 2020, 2021; Kim et al., 2020, 2021; 
Micheli et al., 2023; Nunes et al., 2020; Pan 
et al., 2022; Robine et al., 2023). Such mod- 
els can be useful for training agents, as they can 
be used for learning policies without direct envi- 
ronment experience at agent training time. How- 
ever, learning the models themselves typically re- 
quires action-conditioned data obtained directly 
from the environment. In contrast, our approach 
seeks to learn a world model in an unsupervised 
fashion from videos alone. Recently, there has 
been renewed emphasis on scaling world models. 
GAIA-1 (Hu et al., 2023) and UniSim (Yang et al., 
2023) learn world models for autonomous driv- 
ing and robotic manipulation respectively. These 
approaches require both text and action labels, 
while we focus on training from video-only data 
from publicly available Internet videos. 


Video models Our work is related to video mod- 
els, which typically condition on initial frames (or 
text) and predict the remaining frames in a video 
(Blattmann et al., 2023b; Brooks et al., 2024; 
Clark et al., 2019; Finn et al., 2016; Ho et al., 
2022a,b; Hoppe et al., 2022; Kalchbrenner et al., 
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2017; Le Moing et al., 2021; Lotter et al., 2017; 
Luc et al., 2020; Singer et al., 2023; Walker et al., 
2021; Yan et al., 2021; Yu et al., 2023). Our ap- 
proach most resembles recent transformer based 
models such as Phenaki (Villegas et al., 2023), 
TECO (Yan et al., 2023) and MaskViT (Gupta 
et al., 2023), as we use MaskGIT (Chang et al., 
2022) and an ST-Transformer (Xu et al., 2020) 
over tokenized images. While video models are 
becoming increasingly controllable (e.g. (Huang 
et al., 2022)), we seek a more agentic goal and 
explicitly learn a latent action space from data, al- 
lowing users or agents to “play” the model using 
latent action-conditioned predictions. 


Playable Video Generation Genie generalizes 
beyond Playable Video Generation (PVG) (Mena- 
pace et al., 2021), where latent actions are used 
for controlling world models learnt directly from 
videos (Menapace et al., 2021, 2022). In contrast 
to Genie, PVG considers domain-specific static 
examples, rather than generating entirely new 
environments via prompting. Thus, scaling be- 
yond this setting required non-trivial architectural 
changes, dropping inductive biases in exchange 
for a general method. 


Environment generation Our work is also re- 
lated to Procedural Content Generation (PCG, e.g. 
Risi and Togelius, 2020a,b) where machine learn- 
ing has proven highly effective for generating 
game levels (Summerville et al., 2018), recently 
via language models that directly write game 
code (Sudhakaran et al., 2023; Todd et al., 2023). 
Language models themselves can also be consid- 
ered to be interactive environments (Wong et al., 
2023), albeit lacking a visual component. By con- 
trast in our setting the levels can be learnt and 
generated directly from pixels, which enables us 
to utilize the diversity of Internet video data. 


Training agents with latent actions Prior 
works have used latent actions for imitation from 
observation (Edwards et al., 2019), planning (Ry- 
bkin* et al., 2019) and pre-training RL agents 
(Schmidt and Jiang, 2024; Ye et al., 2022). These 
approaches have similar objectives to our latent 
action model, though have not been applied at 
scale. VPT (Baker et al., 2022) is a recent ap- 
proach that uses an inverse dynamics model 


*4* Genie: Generative Interactive Environments 


learnt from human-provided action labeled data, 
to label Internet-scale videos with actions that can 
then be used for training a policy. We showed, 
in contrast, that we can use latent actions learnt 
from Internet videos to infer policies for arbitrary 
environments, avoiding the need for ground-truth 
actions that are costly and may not generalize. 


5. Conclusion and Future Work 


We proposed Genie, a new form of generative AI 
that enables anyone, even children, to dream up, 
create, and step into generated worlds as we can 
with human-designed simulated environments. 
Genie can be prompted to generate a diverse set of 
interactive and controllable environments despite 
training from video-only data. 


There are clear improvements that can be made 
to the model. Genie inherits some of the weak- 
nesses of other autoregressive transformer mod- 
els, and can hallucinate unrealistic futures. And 
while we have made progress with spatiotem- 
poral representations, we are still limited to 16 
frames of memory which makes it challenging 
to get consistent environments over long hori- 
zons. Finally, Genie currently operates around 
1FPS and requires future advances to achieve an 
efficient frame rate for interaction. 


Still, we believe Genie opens up vast poten- 
tial for future research. Given its generality, the 
model could be trained from an even larger pro- 
portion of Internet videos to simulate diverse, re- 
alistic, and imagined environments. Furthermore, 
we only briefly touched upon the capabilities of 
using Genie for training agents, but given that 
the lack of rich and diverse environments is one 
of the key limitations in RL, we could unlock new 
paths to creating more generally capable agents. 


Broader Impact 


Societal Impact Genie could enable a large 
amount of people to generate their own game-like 
experiences. This could be positive for those who 
wish to express their creativity in a new way, for 
example children who could design and step into 
their own imagined worlds. We also recognize 
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that with significant advances, it will be critical to 
explore the possibilities of using this technology 
to amplify existing human game generation and 
creativity—and empowering relevant industries 
to utilize Genie to enable their next generation of 
playable world development. 


Training Data and Weights: We have chosen 
not to release the trained model checkpoints, the 
model’s training dataset, or examples from that 
data to accompany this paper or the website. We 
would like to have the opportunity to further en- 
gage with the research (and video game) commu- 
nity and to ensure that any future such releases 
are respectful, safe and responsible. 


Reproducibility: We understand that it may 
be challenging for researchers with fewer compu- 
tational to reproduce our main results. In order 
to mitigate this issue, we describe a smaller scale, 
fully reproducible example in Appendix F that can 
run on a single mid-range TPU (or GPU). Given 
that many design choices translate between the 
two settings, we believe this will make it possible 
for the broader community to investigate future 
architectural improvements as well as additional 
research directions resulting from our work. 
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Additional Example Trajectories 


Prompt 


s TAJ 
E 


Figure 16 | More example trajectories: the model is prompted with either hand-drawn sketches, 


images generated from text-to-image generative models or realistic photos. Actions that drive the 
dynamics of the trajectory are provided by human input. 
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Prompt Prompt — @@lett g@risht gioco  gcnocoo left ga: right : jump ga: no-op 


Figure 17 | Controllable, consistent latent actions in Platformers: trajectories beginning from four 
different starting frames from our Platformers dataset. Each column shows the resulting frame from 
taking the same latent action five times. Despite training without action labels, not only are the same 
actions consistent across varied prompt frames, but also have semantic meaning: left, right, jump, 
and no-op. 


Dataset 


B.1. Platformers Dataset 


Initial Dataset We generated a dataset by filtering publicly available Internet videos, using the 
following criteria: 


* The title contains keywords relating to 2D platformer games. 
* The title or description must contain an action word, such as “speedrun” or “playthrough”. 
* The title must not contain negating words such as “movie” or “unboxing”. 


We then split each video into 16s clips at 10 FPS, which corresponds to 160 frames per clip. Our 
resulting dataset contains 55M videos, which totals around 244k hours. When selecting keywords, 
we manually spot checked results to check that they typically produced 2D platformer gameplay 
videos which are not outnumbered by other sorts of videos which happen to share similar keywords. 


Filter Pipeline We noticed that many of the videos in the dataset were of poor quality, impacting 
our model performance. We propose a scalable approach to systematically filter the data, using a 
learned classifier as in Baker et al. (2022). First, we define high quality videos as those that display 
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clear gameplay and do not contain distractor items such as menu screen or streamer faces. We then 
filter this data as follows: 


1. Our team hand labelled 10k videos, with roughly ten hours of total human effort. The labels 
ranged from 5 (best) to 1 (worst) quality. 

2. We trained a 11M parameter ResNet18 (He et al., 2016) with binary classification where we 
deleted all entries rated 2-4 and classified 5 as good and 1 as bad. 

3. We then apply a decision rule based on model prediction and confidence to determine whether 
to keep the video. 


Consistent to findings in prior work Baker et al. (2022); Oquab et al. (2023), having high quality 
data outweighs the quantity of data — even though the curated datasaet is only just over 10% the size 
of the original dataset, the model trained on the curated dataset outperforms in terms of FVD, see 
Table 4. Our final dataset is 6.8M videos for a total of over 30k hours. 


Table 4 | Effect of dataset curation. 


ZParams  FVD (]) 


Original dataset (55M videos) 580M 61.4 
Curated dataset (6.8M videos) 580M 54.8 


Training details 


C.1. Latent Action Model Training 


We found a benefit from increasing the number of codes (i.e. number of actions), at the cost of 
reduced playability for human and AI agents. 


Table 5 | Platformers action model hyperparameters 


Component Parameter Value 


Encoder num layers 20 
d model 1024 
num heads 16 


Decoder num layers 20 
d model 1024 
num heads 16 


Codebook | num codes 8 
patch size 16 
latent dim 32 


Note that the model inputs are normalized between 0 and 1 and the final outputs of the decoder 
are placed through a sigmoid. 


C.2. Video Tokenizer Training 


Here we describe our video tokenizer training. We found it more effective to scale our decoder than 
the encoder, and a marginal gain from increasing batch size (see Table 6). 
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Table 6 | Tokenizer batch size scaling hyperparameters. 


batch size training hardware FLOPs PSNR 


64 64 TPUv2 4.221029 35.7 
384 64 TPUv3 2.57 x10?! 36.5 


Table 7 | Platformers video tokenizer hyperparameters. 


Component Parameter Value 


Encoder num layers 12 
d_model 512 
num_heads 8 
k/q_size 64 

Decoder num layers 20 


d model 1024 
num heads 16 


k/q size 64 
Codebook num codes 1024 
patch size 4 


latent dim 32 


We train our video tokenizer for 300k steps using the AdamW optimizer, with cosine decay, using 
the hyperparameters in Table 8. 


Table 8 | Video tokenizer optimizer hyperparameters 


Parameter Value 
max lr 3e-4 
min lr 3e-4 
Bı 0.9 
Bo 0.9 


weight decay 1le-4 
warmup steps 10k 


C.3. Dynamics Model Training 


Scaling Experiments Details 


In this section we provide more details on the architecture as well as compute budget for the scaling 
experiments. 


Scaling model size For all models we use a batch size of 256. We train all models for 200k steps, 
thus use a total of 750B training tokens for each run. All runs make use of batch parallelism and 
stage-3 ZeRO sharding (Rajbhandari et al., 2020), while our larger models also make use of tensor 
parallelism (Shoeybi et al., 2019). For this experiment we make use of TPUv2 and TPUv3 (Jouppi 
et al., 2020). See Table 10 for more details. 
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Table 9 | Dynamics model optimizer hyperparameters 


Parameter Value 
max lr 3e-5 
min lr 3e-6 
Bı 0.9 
Bo 0.9 


weight decay 1le-4 
warmup steps 5k 


Table 10 | Model size scaling architectures and compute usage. All models were trained for 200k 
steps with a batch size of 256, equating to 750B tokens. 


Parameters num layers num heads d model k/qsize training hardware training time FLOPs 

41M 18 8 512 64 64 TPUv2 3 days 2.05 x 107° 
96M 16 16 768 64 64 TPUv2 6 days 3.58 x 102° 
192M 20 18 1024 64 64 TPUv2 9 days 6.4 x 107° 
404M 21 12 1536 128 64 TPUv2 18 days 1.2 x 10?! 
811M 20 20 2048 128 128 TPUv3 7 days 2.2 x 10?! 
1.6B 28 22 2560 128 128 TPUv3 12 days 4.04 x 10? 
2.7B 36 22 3072 128 256 TPUv3 16 days 6.91 x 10?! 


Scaling batch size All models use the same architecture with 2.3B parameters, as shown in Table 11, 
and train for 200k steps. The only difference between the three runs is hardware—the 128, 256 and 
448 batch size models train on 64 TPUv3, 128 TPUv3 and 64 TPUv5p respectively. 


Table 11 | Batch size scaling hyperparameters. All models use the following architecture for 200k 
steps, differing only in batch size. 


Parameters num layers num heads d model k/q size 


2.3B 34 20 2560 128 


Genie Model The parameter count, model architecture as well as compute usage of the dynamics 
model for the final Genie model is listed in Table 12. We train a 10.1B dynamics model with a batch 
size of 512, for a total of 125k steps using 256 TPUv5. 


Behavioral Cloning Details 


In this section we provide more details about our behavioral cloning experiments. We train within the 
Procgen CoinRun environment (Cobbe et al., 2020) and evaluate in a held out test set. We assume we 
have a dataset of expert sequences in this environment from an agent trained with R2D2 (Kapturowski 
et al., 2018). We then train an agent to imitate from this data. Notably, the oracle agent has access 
to the corresponding ground-truth expert actions. We now discuss how we can utilize a pre-trained 
LAM to infer the actions taken. 


E.1. Genie LAM 


In order to train an agent to imitate from unseen videos, we can use a frozen LAM from a Genie 
model trained on Internet videos. Given an expert sequence (xt, X1) we extract the corresponding 
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Table 12 | Genie dynamics model hyperparameters. 


Parameters num layers num heads d model k/q size FLOPs 
10.1B 48 36 5120 128 66x10” 


latent action label a; — LAM(x;, x41). We then train a policy z(a;|x;) to predict the likelihood of 
the expert taking latent action a, given observation x,. Note that this procedure is similar to prior 
works that learn from videos (Baker et al., 2022; Torabi et al., 2018). However, these approaches use 
ground-truth actions for labeling videos whereas we utilize latent actions learnt completely offline. 


During inference, we must map latent actions emitted by the policy to real actions. To do this, 
we utilize a small set of action-labeled expert sequences. Given an expert sequence (xt, Ur, xi.1) (we 
denote u; for ground-truth actions to avoid confusion with predicted latent actions), we use the LAM to 
obtain a latent action a; and fill a dictionary D consisting of mapped latents to a list of corresponding 
real actions. In summary, given an observation x, from the environment, we can obtain the most 
likely latent action as a, ~ z(s;), and then take the corresponding real action as u, ~ D[a;]. 


Note that other works have used data extracted from the agent's policy to obtain a mapping from 
latent to real actions (Edwards et al., 2019; Ye et al., 2022), but we found using expert data enabled 
us to better evaluate the quality of the learnt policy. As shown in the main text, the agent was capable 
of adapting with as few as 200 expert labels. 


E.2. Architecture 


We train a transformer as the policy for both the oracle and latent BC agents. We utilize our proposed 
ST-ViViT architecture for encoding the frames x1.; = (x1,:-: x;) . All previous actions are placed through 
a one-hot and then combined with the corresponding frame encoding as an additive embedding. We 
use a sequence length of 4 during both training and inference and a batch size of 16. 


Table 13 | BC model optimizer hyperparameters 


Parameter Value 
max lr 3e-5 
min lr 3e-6 
Bı 0.9 

B2 0.96 


weight decay 1le-4 
warmup steps 5k 


Table 14 | BC policy hyperparameters 


Component Parameter Value 


Encoder num_layers 12 
d_model 512 
patch_size 4 

Policy linear layer 512 


Both the oracle and Genie LAM are trained with a cross-entropy loss where targets are either real 
or latent actions, respectively. During inference, we obtain the final prediction by sampling from 
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the predicted logits. Note we found the oracle agent performed better when we randomly sampled 
actions 10% of the time. 


Reproducible Case Study 


In this section we describe a self-contained, fully reproducible case study that can be trained with a 
single mid range TPU/GPU in under a week. 


F.1. Data Collection 


First we need to collect the data to train our model. We use the CoinRun environment from the 
Procgen benchmark (Cobbe et al., 2020) since it has thousands of visually diverse levels with fairly 
simple platformer-like dynamics. Using the “hard” mode, we collect data using a random policy with 
no action repeats. We sample level seeds between zero and 10,000 and collect 1,000 timesteps for 
each level, for a total of 10M transitions. 


F.2. Video Tokenizer Training 


Our video tokenizer for CoinRun follows the same setup as described in Section 2.1, trained with the 
optimizer configuration as in Section C.2. The primary difference in this example is we use smaller 
model sizes (see Table 15), and then use a batch size of 48 sequences, of length 16, for a total of 768 
images per batch. This is sufficient to fit in a single TPU with 16G memory. The model is trained for 
three days using a single TPU which is sufficient to complete 300k steps. 


Table 15 | CoinRun video tokenizer hyperparameters 


Component Parameter Value 


Encoder num_layers 8 
d_model 512 
num_heads 8 

Decoder num_layers 8 
d_model 512 
num_heads 8 

Codebook num codes 1024 
patch_size 4 


latent_dim 32 


F.3. Dynamics + Latent Action Model Training 


Once we have trained the video tokenizer we can then jointly train the latent action and dynamics 
models. Once again we seek to fit our model training inside 16G memory, so we use a batch size of 
36 sequences consisting of 16 frames each, for a total of 576 images. We train both the latent action 
model and dynamics model in parallel, using the setup described above (see: Section C.1 for the 
latent action model and Section C.3 for the dynamics model). 


We train both the latent action and dynamics models in parallel for 200k steps, using the optimizer 
hyperparameters in Table 9. We find this model generates consistent playable latent actions, resembling 
the original environment. 
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Table 16 | CoinRun action model hyperparameters 


Component Parameter Value 


Encoder num layers 8 
d model 512 
num heads 8 

Decoder num layers 8 
d model 512 
num heads 8 

Codebook | num codes 6 


latent dim 32 


Table 17 | CoinRun dynamics model hyperparameters 


Component Parameter Value 
Architecture num layers 12 
d model 512 
num layers 8 
Sampling temperature 1.0 


maskgit steps 25 
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